Add `assert_count_true()` to verify that an expected number of values are `TRUE` #573

billdenney · 2024-05-22T14:04:57Z

Related to #539

I use a variant of this often in my data pipelines where I want to select a certain number of values and be sure that I'm only getting the number I expect. For example, I want to change one date for one person, and I want to be sure that I only change that one date:

# We are very young
my_data <- data.frame(name = c("Bill", "Sam"), birthdate = c("2024-05-22", "2024-05-22"))
my_data |>
  dplyr::mutate(
    birthdate =
      dplyr::case_when(
        assert_count_true(name == "Bill" & birthdate == "2024-05-22") ~ "2024-05-23",
        TRUE ~ birthdate
      )
  )

… are `TRUE`

sfirke · 2024-11-25T15:25:31Z

@billdenney thanks for this and sorry to take so long to review!

Three things come to mind after reviewing:

1. Function name
Would it be cleaner to call this just assert_count ? I think the true is clear enough from the usage, but I don't feel strongly if you prefer the current way.

2. Providing a before-and-after example

I get the value of this function and would use it. But it took me reading each code block to understand its value. I think we can make it clearer upfront in the documentation (package vignette and function docs) that this doesn't change anything in the data; it's a wrapper that performs a data quality check mid-pipeline and halts things if an assertion fails.

The best way I can think of is to show a scenario with before and after example. For instance:

Imagine we have a column in a dataset that contains student test scores. A few scores were entered into the grade platform after the data file was exported, so we're filling them in manually during data cleaning. Ordinarily we might do that like this:

clean <- 
  raw |>
  dplyr::mutate(
    test_score =
      case_when(
        student_id == 12345 & is.na(test_score) ~ 84,
        student_id == 67890 & is.na(test_score) ~ 91,
        TRUE ~ test_score
      )
  )

But this has a risk: if in a later version of the data, one of these students is missing two test scores, the value will be imputed for both cases. assert_count_true defends against that. It wraps the statements, doing nothing if the assumed count is matched and throwing an error if not.

clean <- 
  raw |>
  dplyr::mutate(
    test_score =
      case_when(
        assert_count_true(student_id == 12345 & is.na(test_score), count = 1) ~ 84,
        assert_count_true(student_id == 67890 & is.na(test_score), count = 1) ~ 91,
        TRUE ~ test_score
      )
  )

3. janitor's approach in wading into the assertion space
I may be overthinking a little here but this feels like the start of a separate chapter, even if it's only one case for now. One implication is that I think we could carve out a little separate space in terms of function organization and documentation. Like this could go in a file assertions.R.

I am also considering how janitor complements the existing assertion functions out there. I looked at the assertr vignette and this blog post on 4 ways to be assertive in R.

To me the big difference is that assert_count_true does not take its own line in the pipeline, it wraps an existing line. I didn't see that in the other packages. You could make these same checks with assertr::assert() but at a cost of needing to write two lines for each replacement case: one to assert-check, one to replace. With your function, it's one line for each replacement, so there's less code to read and maintain.

I would argue this niche is unfilled as far as I know and that's this function's reason for being. Do you agree? If so maybe we briefly note that in the function or package docs so people understand when/why to use it.

billdenney · 2024-11-26T03:46:25Z

@sfirke:

For point 2, yes, I like the example. I will incorporate it in some way (maybe a new assertion vignette).

For points 1 and 3, you understand all the ideas that I'm looking toward. I was thinking of the checkmate library which uses assert_* to ensure that aspects of input arguments are true, and Danielle Navarro's blog post (aside: she has an amazing blog!) shows other ways that this makes sense. I thought that it could be a good fit for janitor specifically because of my typical use case which you succinctly described in point 2. I use it to clean dirty data that may change over time or that I want to be sure I only capture the expected number of rows for, even if it doesn't change over time.

I agree that this would be a new chapter for janitor, and it deserves a new organizational layer. I'll work those in over the next few days. I also agree that the _true part of the function name can be assumed (and therefore removed from the function name).

sfirke · 2024-12-16T16:33:11Z

Hi @billdenney - friendly ping here. I am gonna try to submit to CRAN by the end of the year, I got another email from them this month saying it was my final notice to fix the package and resubmit. Think you'll be able to implement the above feedback? If not we could try for a fast-follow release and get a few more outstanding things in Spring 2025.

…true # Conflicts: # NEWS.md

…into assert_count_true

billdenney · 2024-12-18T22:07:11Z

@sfirke, I think that I covered all the to-do items here. Please take a look.

Also, I found a bug in either the commonmark or xml2 package while trying to add spell checking to the package. So, spell checking is only partially added (that's unrelated to the main purpose of the PR, but it seemed helpful sorry for making this a bigger change to review).

codecov · 2024-12-18T22:12:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (81702b6) to head (1cb9c4b).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #573   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        28    +1     
  Lines         1413      1428   +15     
=========================================
+ Hits          1413      1428   +15

Files with missing lines	Coverage Δ
R/assertions.R	`100.00% <100.00%> (ø)`
R/clean_names.R	`100.00% <100.00%> (ø)`
R/round_half_up.R	`100.00% <ø> (ø)`

# Conflicts: # .gitignore

billdenney · 2024-12-23T14:21:14Z

I merged the current main branch back in here. This should be ready to review again. When it's merged, we can head over to the bigger package separation PR.

billdenney added 2 commits May 22, 2024 10:00

Add assert_count_true() to verify that an expected number of values…

7a8e651

… are `TRUE`

Fix pkgdown; add another example

030a03d

billdenney requested a review from sfirke May 22, 2024 14:20

sfirke added this to the v2.3 milestone Nov 25, 2024

billdenney and others added 6 commits December 18, 2024 16:43

Move function name to assert_count(); add assertions vignette

5b4c1fe

Merge remote-tracking branch 'remotes/origin/main' into assert_count_…

de85e90

…true # Conflicts: # NEWS.md

Clean up filename

140d619

Style code (GHA)

0edd037

Merge branch 'assert_count_true' of https://github.com/sfirke/janitor …

42199d8

…into assert_count_true

Fix spelling issues, update janitor.md

f9ec0eb

Merge branch 'main' into assert_count_true

1cb9c4b

# Conflicts: # .gitignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `assert_count_true()` to verify that an expected number of values are `TRUE` #573

Add `assert_count_true()` to verify that an expected number of values are `TRUE` #573

billdenney commented May 22, 2024

sfirke commented Nov 25, 2024

billdenney commented Nov 26, 2024

sfirke commented Dec 16, 2024

billdenney commented Dec 18, 2024

codecov bot commented Dec 18, 2024 •

edited

Loading

billdenney commented Dec 23, 2024

Add assert_count_true() to verify that an expected number of values are TRUE #573

Are you sure you want to change the base?

Add assert_count_true() to verify that an expected number of values are TRUE #573

Conversation

billdenney commented May 22, 2024

sfirke commented Nov 25, 2024

billdenney commented Nov 26, 2024

sfirke commented Dec 16, 2024

billdenney commented Dec 18, 2024

codecov bot commented Dec 18, 2024 • edited Loading

Codecov Report

billdenney commented Dec 23, 2024

Add `assert_count_true()` to verify that an expected number of values are `TRUE` #573

Add `assert_count_true()` to verify that an expected number of values are `TRUE` #573

codecov bot commented Dec 18, 2024 •

edited

Loading